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Abstract 

In this paper, we study the problem of learning from weakly labeled data (or weak-label 
learning), where labels of the training examples are incomplete. This includes, for example, 
(i) semi-supervised learning where labels are partially known; (ii) multi-instance learning 
where labels are implicitly known; and (iii) clustering where labels are completely unknown. 
Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer 
Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also 
get stuck in local minimum. In this paper, we focus on SVMs and propose the WellSVM 
via a novel label generation strategy. This leads to a convex relaxation of the original MIP, 
which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. 
Moreover, the WellSVM can be solved via a sequence of SVM subproblems that are 
much more scalable than previous convex SDP relaxations. Experiments on three weakly 
labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning 
for locating regions of interest in content-based information retrieval; and (iii) clustering, 
clearly demonstrate improved performance, and WellSVM is also readily applicable on 
large data sets. 

Keywords: semi-supervised learning, multi-instance learning, clustering, cutting plane, 
convex relaxation 



1. Introduction 

Obtaining labeled data is expensive and difficult. For example, in scientific applications, 
obtaining the labels involves repeated experiments that may be hazardous; in drug pre- 
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diction, deriving active molecules of a new drug involves expensive expertise that may not 
even be available. On the other hand, weakly labeled data, where the labels are incomplete, 
are often ubiquitous in many applications. Therefore, exploiting weakly labeled training 
data may help improve performance and discover the underlying structure of the data. 
Indeed, t his has been re garded as one of the most challenging tasks in machine learning 
research (jMitchell liooel ). 

Many weak-label learning problems have been proposed. In the following, we summarize 
several major learning paradigms with weakly labeled data: 

• Labels are partially known. A represen tative example is semi-supervised learning 



(SSL) (jChapelle et all . bnOBbl : IZhJ . l2nnfil ;i. where most of the training examples are 



unlabeled and only a few are labeled. SSL improves generalization performance by 
using the unlabeled examples that are often abundant. In the past decade, SSL has 
attracted much attention and achieved successful results in diverse applications such 
as text categorization, image retrieval, and medical diagnosis. 



Labels are implicitly known. Multi-instance learning (MIL) ( Dietterich et al. . 19971 ) 



IS 



the most prominent example in this category. In MIL, training examples are called 
bags, each of which contains multiple instances. Many real-world objects can be nat- 
urally described by multiple instances. For example, an image (bag) usually contains 
multiple semantic regions, and each region is an instance. Instead of describing an ob- 
ject as a single instance, the multi-instance representation can help separate different 
semantics within the object. MIL has been successfully applied to diverse domains 
such as image classification, text categorization, and web mining. The relationship 
bet ween multi-instance learning and semi-supervised learning has also been discussed 
in (jZhou and xJ . l2nn7l l. 

In traditional MIL, a bag is labeled positive when it contains at least one positive in- 
stance, and is labeled negative otherwise. Although the bag labels are often available, 
the instance labels are only implicitly known. It is worth noting that identification 
of the key (or positive) instances from the positive bags can be very useful in many 
real- world applications. For example, in content-based information retrieval (CBIR), 
the explicit identification of regions of interest (ROI) can help the user to recognize 
images that he/she wants quickly (especially when the system returns a large number 
of images). Similarly, to detect suspect areas in some medical and military appli- 
cations, a quick scanning of a huge number of images is required. Again, it is very 
desirable if ROIs can be identified. Besides providing an accurate and efficient pre- 
diction, the identificati on of key instances is also useful in understanding ambiguous 



objects (|Li et all . 120121 ') 



Labels are totally unknown. This becomes unsupervised learning (jJain and Dubesl . 
19881 ). which aims at discovering the underlying structure (or concepts/labels) of the 



data and grouping similar examples together. Clustering is valuable in data analy- 
sis, and is widely used in various domains including information retrieval, computer 
version, and bioinformatics. 

There are other kinds of weak-label learning problems. For instances, Augluin and 
Laird (1998) and references therein studied noisy-tolerant problems where the label 
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information is noisy; Sheng et al. ( 20081 ) and references therein considered learning 
from mult iple annot a tion r esult s by different expe rts in which all the experts are 
imperfect; ISun et alJ (|2ninl ) and iBucak et alJ (|201lh considered weakly labeled data 
in the context of multi-label learning. 

Unlike supervised learning where the training labels are complete, weak-label learning 
needs to infer the integer-valued labels of the training examples, resulting in a difficult 
mixed-integer programming (MIP). To solve this problem, many algorithms have been pro- 



vex SDT^elaxations ( Xu et al.l . 12005 



posed, including global optimiza t ion (IChapelle et alJ.l2008l:ISin dhwani et al.l.l2006l) and con 



Xu and Schuurmansl 12005: De Bie and Cristianini 



20061 : IGuoI . l2009l ). Empirical studies have demonstrated their promising performance on 
small data sets. Although SDP convex relaxations can reduce the training time complexity 
of global optimization methods from exponential to polynomial, they still cannot han- 
dle medium-sized data sets having thousands of examples. Recently, several algorithms 
resor t to using non-convex optimization techniques (such as alter nating optimization meth- 
ods dAndrews et HI 1200.4 Izhang et al.l. l2007l: IlI et"ID. l2009bl ) and constrained 



concave procedure (jCollobert et al 



convex- 



120061 : ICheung and KwoiTbood : Izhao et all . 120081 ^1. 

Although these approaches are often efficient, they can only obtain locally optimal so- 
lutions and can easily get stuck in local minima. Therefore, it is desirable to develop a 
scalable yet convex optimization method for learning with large-scale weakly labeled data. 



2009; 


Zhang et al.. 


2009a: 


Vapnik . 



we are more in- 



terested in inductive learning methods. 

In this paper, we will f ocus on the bina ry support vector machines (SVM). Extending 



our preliminary works in (jLi et al.l . l2009d jah. we propose a convex weakly labeled SVM 



(denoted WellSVM (WEakly LabeLed SVM)) via a novel "label generation" strategy. 
Instead of obtaining a label relation matrix via SDP, WellSVM maximizes the margin 
by generating the most violated label vectors iteratively, and then combines them via ef- 
ficient multiple kernel learning techniques. The whole procedure can be formulated as a 
convex relaxation of the original MIP problem. Furthermore, it can be shown that the 
learned linear combination of label vector outer-products is in the convex hull of the label 
spa ce. Since the convex hull is the smallest convex set containing the target non-convex 
set (jBovd and Vandenberghd. |2004|) . our formulation is at least as tight as the convex SDP 



set (povd and vandenbergnel. 1/1)1)41). our lormuiation is at least as tignt as tne convex autr 
relax ations proposed in (|Xu et al.l . l2005l :l iDe Bie and Cristianinil . l2006l : IXu and Schuurmansl . 



20051 ). Moreover, WellSVM involves a series of SVM subproblems, which can be readily 
solved in a scalable an d efficient m anner via state- of-the-art SVM softwares such as LIB- 
SVM dFan et alJ.l2005l). S VM-perf (| Joachimsl . liooel l . LIBLINEAR (|Hsieh et al.l . I2OO8I I and 



CVM (jTsang et al.l . I2OO6I ). Therefore, WellSVM scales much better than existing SDP 



approaches or even some non-convex approaches. Experiments on three common weak-label 
learning tasks (semi-supervised learning, multi-instance learning, and clustering) validate 
the effectiveness and scalability of the proposed WellSVM. 

The rest of this paper is organized as follows. Section [2] briefly introduces large margin 
weak-label learning. Section |3| presents the proposed WellSVM and analyzes its time 
complexity. Section HI presents detailed formulations on three weak-label learning problems. 
Section [5] shows some comprehensive experimental results. The last section concludes this 
work. 
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In the following, M ^ (resp. M ^ 0) denotes that the matrix M is symmetric and 
positive definite (pd) (resp. positive semidefinite (psd)). The transpose of vector / matrix 
(in both the input and feature spaces) is denoted by the superscript and 0, 1 S M" denote 
the zero vector and the vector of all ones, respectively. The inequality v = [vi, . . . , Vk]' > 
means that f j > for i = 1, . . . ,k. Similarly, M > means that all elements in the matrix 
M are nonnegative. 



2. Large-Margin Weak-Label Learning 

We commence with a simpler supervised learning scenario. Given a set of labeled examples 
V = {'X-i,yi}^i where Xj G is the input and t/i G {±1} is the output, we aim to find 
a decision function / : A" — )• {±1} such that the following structural risk functional is 
minimized: 

mm n{f) + Cif{V). (1) 

Here, 17 is a regularizer related to large margin on /, ^f{T>) is the empirical loss on D, and 
C is a regularization parameter that trades off the empirical risk and model complexity. 
Both Q and if{-) are problem-dependent. In particular, when if{-) is the hinge loss (or its 
variants), the obtained / is a large margin classifier. It is notable that both and Lf(-) 
are usually convex. Thus, Eq.([T|) is a convex problem whose globally optimal solution can 
be efficiently obtained via various convex optimization techniques. 

In weak-label learning, labels are not available on all N training examples, and so also 
need to be learned. Let y = [yi, • • • , t/at]' G {±1}^ be the vector of (known and unknown) 
labels on all the training examples. The basic idea of large-margin weak-label learning is 
that the structural risk functional in Eq.([l]) is minimized w.r.t. both the labeling y and 
decision function /. Hence, Eq.([T]) is extended to 

minmin n{f) + C £j{{^„m}l,), (2) 

where ;B is a set of candidate label assignments obtained from some domain knowledge. For 
example, when the positive and negative examples are known to be approximately balanced, 
we can set B = {y : —(3 < Xli^i Vi — /^} where /5 is a small constant controlling the class 
imbalance. 



2.1 State-of-The-Art Approaches 



As Eq.dJ]) involves optimizing the integer variables y, it is no longer a convex optimization 
problem but a mixed-integer program. This can easily suffer from the local minimum 
problem. Recently, a lot of efforts have been devoted to solve this problem. They can be 
grouped into three categories. The first strategy optimizes E q.Q via variants of non-convex 
optimization. Examples in clude alternating optimization 



Zhang et al.l . bnOQbl : iLi et al 



5)iesmc 
in which we alternatively optimize variable y (or /) by keeping 
the other variable / (or y) co nstant; constrained conv e x-concave procedure (CCCP) (also 
known as DC programming) ( Horst and Thoai . 1999 : Zhao et al. . 20081 : Collobert et al. 



1. To simplify notations, we write minygg, though indeed one only needs to minimize w.r.t. the unknown 
labels in y. 
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20061 : ICheung and Kwokl . hood ) , in which the non-convex objective function or co nstraint is 
decom posed as a difference of two convex functions; local combinatorial search (jjoachimsl . 
1993), in which the labels of two examples in opposite classes are sequentially switched. 
These approaches are often computationally efficient. However, since they are based on 
non-convex optimization, they may inevitably get stuck in local minima. 

The second strategy obtains the globall y optimal solution of Eg. ([2]) via global optimiza- 
tion. Exam^l^sjiicludebra ( Chapelle et al. . 20081 ) and deterministic anneal 



ing ( Sindhwani et all 20061 ). Since they aim at obtaining the globally o ptimal (instead 



of th e locally optimal) solution, excellent performance can be expected (jChapelle et al 
20081 ). However, their worst-case computational costs can scale exponentially as the data 



set size. Hence, these approaches can only be applied to small data sets with just hundreds 
of training examples. 

The third strategy is based on convex relaxations. The original non-convex problem is 
first relaxed to a convex problem, whose globally optimal solution can be efficiently obtained. 
This is then rounded to recover an approximate solution of the original problem. If the re- 
laxation is tight, the approximate solution obtained is close to the global optimum of the 
original problem and good performance can be expected. Moreover, the involved convex pro- 
gramming solver has a time complexity substantially lower than that for global optimization. 
A promine nt example of convex relaxation is th e use of semidefinite programining f SDP) 
techn iques ( Xu et al. . 2005 : Xu and Schuurmand . 2005 : De Bie and Cristianini . 20061 : Guol . 
2OO9I I. in which a positive semidefinite matrix is used to approximate the m atrix of label 
outer-products. The tim e complexity of this SDP-based st rategy is 0{N^'^) (jZhang et al 



2009bl : iLobo et al.l . Il998l : iNesterov and NemirovskiH. Il987l'). where N is the data set size , 
and can be further reduced to 0{N^-^) (jZhang et al.1 . l2009bl : IValizadegan and Jinl . 12003). 
However, this is still expensive for medium-sized data sets with several thousands of exam- 
ples. 

To summarize, existing weak-label learning approaches are not scalable and can be 
sensitive to initialization. In this paper, we propose the WellSVM algorithm to address 
these two issues. 



3. WellSVM 

In this section, we first introduce the SVM dual which will be used as a basic reformulation 
of our proposal, and then we present the general formulation of WellSVM. Detailed 
formulations on three common weak-label learning tasks will be presented in Section HI 



3.1 Duals in Large Margin Classifiers 

In large margin classifiers, the inner minimization problem of Eq.([2]) is often cast in the 
dual form. For example, for the standard SVM without offset, we have $7 = ^||w|p and 
if{D) is the summed hinge loss. The inner minimization problem is then 



mm 



1, 



N 



W 



i=l 



s.t. yiw'(j){yii) > 1 - Ci, ii> 0, 



,V, 
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where (p{xi) is the feature map induced by kernel k, and its dual is 

max ct'l — -a'(K yy'^Q! 
s.t. Cl>a>0, 

where a G is the dual variable, K G M^^^ is the kernel matrix defined on the N 
samples, and is the element-wise product. For more details on the duals of large margin 



class ifiers, interested readers are referred to (jScholkopf and Smolal . l2002l : ICristianini et al. 



20021 ^. 



In this paper, we make the following assumption on this dual. 

Assumption 1 The dual of the inner minimization of Eq. can be written as: maxQg_4 G{a, 
where a = [ai, . . . , a^]' contains the dual variables and 

• A is a convex set; 

• G{a,y) is a concave function in a for any fixed y; 

• gyioi) = — G(Q:,y)|y=y is X- strongly convex and M- Lips chitz. In other words, V^(7y(a) — 
AI >z 0, where I is the identity matrix, and \\gy{a) — gy{a)\\ < M\\a — a\\, Vy G 
B, a, Q G A; 

• Vy & B, lb < maxag^ G(q:, y) < ub, where lb and ub are polynomial in N ; 

• G{ct,y) can be rewritten as G{a,M.), where M is a psd matrix, and G is concave in 
a and linear in M. 

With this assumption, Eq.(l2|) can be written as 

minmax G(a,y), (3) 

Assume that the kernel matrix K is pd (i.e., the smallest eigenvalue Amm > 0) and all 
its entries are bounded < v for some v). It is easy to see that the following SVM 

variants satisfy Assumption [TJ 

• Standard SVM without offset: We have 

A = {q I CI > « > 0}, 

G{a,y) = a'l-ia'(K0yy')a, 

V^gy{a) = K yy' ^ Amin(I yy') = AminI, 
\\gy{a) - gy{a)\\ < {1 + GvN)^/N\\a - a\\, 
< max G(q;, y) < CiV, 

G{a,M^) = a'l - ^Q:'(K0My)a, where My = yy. 
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^-SVM (jScholkopf and Smolal . 12002 1: We have 
A = {cx\a>0,cx'l = l}, 
G{cx,y) = -ia'(^(K + il)0yy')a, 

V'5y(a) = (k + ^I^ yy' t (^A^in + ^) (l yy ) = (^X^in + ^) I, 

\\gy{a) - gy{<x)\\ < (^v + nVn\\cx - <x\\, 

+ < maxG(a,y)<0, 



G(«,My) = -la'(^(K + ll)0My)a. 



3.2 WellSVM 

Interchanging the order of maxQ,g_4 and mm^gg in Eq.(l3|), we obtain the proposed WellSVM: 

(WellSVM) maxmin G{cx,y). (4) 

Using the minimax theorem ( Kim and Bovd . 20081 ). the optimal objective of Eq.Q upper- 
bounds that of Eq.(U]). Moreover, Eq.(j3|) can be transformed as 

max \ maxg —6 (5) 

s.t. 0>-G(a,yj), VytGis}, 

from which we obtain the following Proposition. 

Proposition 1 The objective of WellSVM can be rewritten as the following optimization 
problem: 

min max //jG(Q,yj), (6) 
t.yt&B 

where /x is the vector of fit 's, M is the simplex {^i \ Ylt l^t ~ l-'-t — ^'^^ yt ^ ^■ 

Proof For the inner optimization in Eq.dS]), let /^t > be the dual variable for each 
constraint. Its Lagrangian can be obtained as 



-9+ ^H[0 + Gia,yt)). 



t.ytdB 



Setting the derivative w.r.t. to zero, we have '^tfJ't = 1- We can then replace the inner 
optimization subproblem with its dual and Eq.([5]) becomes: 

max min > ntG{a,yt) = min max > iitG(a,yt). 

aeA IJL&M ^ tJ.£M a£A ^ 
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Here, we use the fact that the objective function is convex in /x and concave in o;. ■ 

Recall that G{a,y) is concave in a. Thus, the constraints in Eq.Q are convex. It is 
evident that the objective in Eq.(l5|) is linear in both a and 6. Therefore, Eq.([5]) is a convex 
problem. In other words, WellSVM is a convex relaxation of Eq.([2|). 



3.3 Tighter than SDP Relaxations 

In this section, we compare our minimax relaxation with SDP relaxations. I t is notab l e that 
the SVM without of f set is always employed by previous SDP relaxations ( Xu et al. . 20051 : 
Xu and Schuurmans . 2005 : De Bie and Cristianini . 20061 ). Recall the symbols in Section [HTTl 
Define 

3^o = {M|M = M5„ y€B}. (7) 
The original mixed- integer program in Eq.([3|) is the same as 



min max G(q:,M). 



(8) 



Define = {M | M = J2t-yt&B t^t^yt^ M ^ -M]. Our minimax relaxation in Eq.(l6|) can 
be written as 



min max > u+Gfa, M<^, ) = min max G ( a, > ut^vt 



t-.ytdB 



t-.yteB 



min max G(q:,M). 



(9) 



On the other hand, the SDP relaxations in (|Xu et al.1 . I2OO5I : IXu and SchmirmansL I2OO5I : 
De Bie and CristianinM are of the form 



min max G(a,M), 
Mey2 aeA 



(10) 



where = {M | M ^ 0, M £ A^r}, and M r^ is a convex set related to B. For example, 
in the context of clustering, Xu et al. ( 20051 ) used B = {y \ — (3 < I'y < /?}, where /3 
parameter controlling the class imbalance, and Ais is defined as 



IS a 



1 /clustering 



{m = [mi 



-1 < niij < l;mjj 



l,mij 



m 



rriik > rriij + mjk - 1, nijk > -rrnj - rriik - 1, 
N 

-P<Y.rni,<(3, yi,j,k = l,...,Ny 



i=l 



It is easy to verify that yn C and is convex. Similarly, in semi-supervised learning, 
Xu and Schuurmans ( 20051 ) and De Bie and Cristianini ( 20061 ) defined A^b as a subselH of 



^clustering ^ Again, 3^0 ^ 3^2 and 3^2 is convex. 



2. For a more precise defini tion, interested readers are referred to l|Xu and Scfiuurmanj. l2005l : 
iDe Bie and Cristianinil . l2006l 'l. 
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Algorithm 1 Cutting plane algorithm for WellSVM. 



Initialize y, C = and obtain the optimal a from Eq. (jlip . 
repeat 

Update C ^ {y}\JC- 

Obtain the optimal a from Eq. ()lip . 

Generate a violated y. 
until G{a,y) > minygc G(q;, y) — e (where e is a small constant) or the decrease of 
objective value is smaller than a threshold. 



Theorem 1 The relaxation of WellSVM is at least as tight as the SDP relaxations in 



i 'Xu et ai . 200d : Xu and Schuurmam . 200d : De Bie and CristianinL 200d ). 



Pr oof Note that y^ is the conv ex hull of yo, i.e., the smallest convex set containing 



3^0 ( Bovd and Vandenberghe . 20041 ) . Therefore, Eq.Q gives the tightest convex relaxation 



of Eq.([8|), i.e., C 3^2- In other words, our relaxation is at least as tight as SDP relax- 
ations. ■ 



3.4 Cutting Plane Algorithm by Label Generation 

It appears that existing convex optimization techniques can be readily used to solve the 
convex problem in Eq.([U]), or equivalently Eq.([2|). However, note that there can be an 
exponential number of constraints in Eq.Q, and so a direct optimization is computationally 
intractable. Fortunately, typically not all these constraints are active at optimality, and 
including only a subset of them can lead to a very good approximation of the origi nal 
optimization problem. Therefore, we can apply the cutting plane method ( Kelley . 1960l ). 



The cutting plane algorithm is described in Algorithm [TJ First, we initialize a label 
vector y and the working set C to {y}, and obtain a from 

minmax fitG{a,yt) (11) 
t-.ytec 

via standard supervised learning methods. Then, a violated label vector y in Eq.([5]) is 
generated and added to C. The process is repeated until the termination criterion is met. 
Since the size of the working set C is often much smaller than that of B, one can use existing 
convex optimization techniques to obtain a from Eq. (jlip . 

For the non-convex optimization methods reviewed in Section 12.11 ^ ii^w label assign- 
ment for the unlabeled data is also generated in each iteration. However, they are very 
different from our proposal. First, those algorithms do not take the previous label assign- 
ments into account, while, as will be seen in Section 4.1.2, our WellSVM aims to learn 
a combination of previous label assignments. Moreover, they update the label assignment 
to approach a locally optimal solution, while our WellSVM aims to obtain a tight convex 
relaxation solution. 
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3.5 Computational Complexity 

The key to analyzing the running time of Algorithm [T] is its convergence rate, and we have 
the following Theorem. 

Theorem 2 Let p^^^ be the optimal objective value of Eg. 177]) at the t-th iteration. Then, 

p{t+i) < p{t) _ (-^2) 

where r/ = ( -^+^/+^^ )^ and c = M^j2/\. 

Proof is in Appendix O Prom Theorem [21 we can obtain the following convergence rate. 

Proposition 2 Algorithm^ converges in no more than - — iterations, where p* is the 
optimal objective value of WellSVM. 

According to Assumption 1, we have p* = miuygg max^g^ G(q!, y) > lb and p^^^ = 
maxQg_4 G(q!, y) < ub. Moreover, recall that lb and ub are polynomial in A^. Thus, Propo- 
sition [2] shows that with the use of the cutting plane algorithm, the number of active 
constraints only scales polynomially in N. In particular, as discussed in Section [3.11 for the 
/^-SVM, lb = —^{v + and ub = 0, both of which are unrelated to A^. Thus, the number 
of active constraints only scales as 0(1). 

Proposition [2] can be further refined by taking the search effort of a violated label into 
account. The proof is similar to that of Theorem [2j 

Proposition 3 Let > e, Vr = 1,2,..., be the magnitude of the violation of a violated 
label in the r-th iteration, i.e., = miuygc^ G(q:, y) — G{a,y^), where Cr and denote 
the set of violated labels and the violated label obtained in the r-th iteration, respectively. 

Let i]r = +^^'- ) . Then, Algorithm^ converges in no more than R iterations where 



Hence, the more effort is spent on finding a violated label, the faster is the convergence. 
This represents a trade-off between the convergence rate and cost in each iteration. 

We will show in Section [4] that step 4 of Algorithm [T] can be addressed via multiple 
kernel learning techniques which only involve a series of SVM subprob lems that c an be 
solved efficiently by state-of-the-art SVM softwares such as LIBSVM (|Fan et al.l . hm^ ) 



and LIBLINEAR ( Hsieh et al. . 20081 ) . while step 5 can be efficiently addressed by sorting. 



Therefore, the total time complexity of WellSVM scales as the existing SVM solvers, and 
is significantly faster than SDP relaxations. 

4. Three Weak-Label Learning Problems 

In this section, we present the detailed formulations of WellSVM on three common weak- 
label learning tasks, namely, semi-supervised learning (Section 14. ip . multi-instance learning 
(Section 14. 2p . and clustering (Section 14. 3p . 
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4.1 Semi-Supervised Learning 

In semi-supervised learning, not all the training labels are known. Let Dc — \j^iiyi\i=i 
and Vu = {xjjj^j^^ be the sets of labeled and unlabeled examples, respectively, and 
£ = {1, . . . , /} (resp. U = {I + 1, . . . , N}) be the index set of the labeled (resp. unlabeled) 
examples. In semi-supervised learning, unlabeled data are typically much more abundant 
than labeled data, i.e., N — I ^ I. Hence, one can obtain a trivially "optimal" solution with 
infinite margin by assigning; all the unlabeled examples to the same label. To prevent such 
a useless solution, Joachima ( 19991 ) introduced the balance constraint 

^'yu _ I'yc 

N-l I ' 

where y = [yi, • • • , y^]' is the vector of learned labels on both labeled and unlabeled exam- 
ples, yc = [yi, • • • and yu = [vi+i, ■ ■ -^Vn]'- Let Vt = ^||w|p and if{V) be the sum of 
hinge loss values on both labeled and unlabeled data, Eq.(l2|) leads to 

^ I N 

ininmin ttI |w| |^ + Ci V + C2 V 6 (13) 

s.t. yiW4>{yLi) >l-ii, i = 1. . . ,N, 

where B = {y \ y = [yc;yu],yc = ycfu G {±1}"^^'; jfy = and Ci,C2 trade off 

model complexity and empirical losses on the labeled and unlabeled data, respectively. The 
inner minimization problem can be rewritten in its dual, as: 

minygB maxo^g^ G{a, y) :=l'a- ^a' (k yy ) Q, (14) 

where a = [ai, . . . , on]' is the vector of dual variables, and A = {a. | Ci > > 0, C2 > 
aj >0,i £ C,j £ U}. 

Using Proposition 1, we have 

min max 1'ol-]-ol{ ^tKQytyj)^, (15) 

which is a convex relaxation of Eq. (jl4p . Note that G(a.^ y) can be rewritten as G(q;, My) = 
I'ct — ^q'IK My Jo;, where G is concave in o; and linear in My. Hence, according to 



Theorem 1, WellSVM is at least as tight as the SDP relaxations in (IXu and Schuurmansl . 
2OO5I : be Bie and Cristianinil . I2OO6I ) . 



Notice the similarity with standard SVM, which involves a single kernel matrix K 0yy ^ 



Hence, Eq. (jl5p can be regarded as multiple kernel learning (MKL) (jLanckriet et al.l . |2004| ) . 
where the target kernel matrix is a convex combination of \B\ base kernel matrices {K 
ytyjjt, each of which is constructed from a feasible label vector yt G B. 

4.1.1 Algorithm 

From Section [31 the cutting plane algorithm is used to solve Eq. ljlSp . There are two impor- 
tant issues that have to be addressed in the use of cutting plane algorithms. First, how to 
efficiently solve the MKL optimization problem? Second, how to efficiently find a violated 
y ? These will be addressed in Sections 14.1.21 and I4.1.3|. respectively. 
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4.1.2 Multiple Label-Kernel Learning 



In rec ent years, a lot of efforts have been devoted on efficient MKL approaches. iLanckriet et al.l 
(Hooi) ir st proposed t he us e of quadratically constrained quadratic programming (QCQP) 



in MKL. iBach et al.l (|2004l ) showed that an approximate soluti on can be e fficiently ob 



tained by using sequeii tial minimization optimization (SMO) (jPlattl . Il999l ). Recently, 
Sonnenburg et al. ( 20061 ) proposed a semi-infinite linear programming (SILP) formulation 
which allows MKL to be iter a tively solved with standard SVM solver and linear program- 



Rakotomamonjy et al. ( 20081 ) proposed a weighted 2-norm regularization with addi- 



mmg. 

tional constraints on the kernel weights to encourage a sparse kernel combination. IXu et al 



(Hooi) proposed the use of the extended le vel method to im prove its convergence, which is 



further refined by the MKLGL aleorithi n (IXu et al 



combinations is also studied recently in (|Kloft et al 




Extension to nonlinear MKL 



Unlike standard MKL problems which try to find the optimal kernel function/matrix for 
a given set of labels, here, we have to find the optimal label ker nel matrix. In this paper, we 
use an adaptation of the MKLGL algorithm ( Xu et al. . 20ld ) to solve this multiple label- 
kernel learning (MLKL) problem. More specifically, suppose that the current working set 
is C = {yi, . . . , yt}- Note that the feature map corresponding to the base kernel matrix 
K Yty't is Xj I—)- yti4>{xi). The MKL problem in Eq. lfTS]) thus corresponds to the following 
primal optimization problem: 



1^1 ' ^ 

Me>!,w= wi,...,WT ,4 2 f-' /it ^ 



(16) 



s.t. 



t=i 



It is easy to verify that its dual can be written as 



max I'a - ^qM ELiMiKOytyH'^' 



mm 

which is the same as Eq. ijlSp . Following MKLGL, we can solve Eq. (|15p (or, equivalently, 
Eq. (116p ) by iterating the following two steps until convergence. 

1. Fix the mixing coefficients /x of the base kernel matrices and solve Eq. (ll6p . By setting 
w = [V/^wi, . . . , ./JI^wt]', Xi = [-^(/>(xi), ■^mm^'i^i), ■ ■ ■ , ^yiiyri0(xi)]' and 
y = yi, Eq. (jl6p can be rewritten as 



I N 
mm -llwlp + Ci^^i + C2 



1 

7,'^ 2' 



(17) 



i=l 



S.t. yiw'xj > 1 - Ci, z = 1, , 



.N, 



which is similar to the primal of the standard SVM and can be efficiently handled by 
state-of-the-art SVM solvers. 



12 



Convex and Scalable Weakly Labeled SVMs 



2. Fix wj's and update fx in closed-form, as 

||W(|| 

fJ't = t = 1, . . . ,T. 

In our experiments, this always converges in fewer than 100 iterations. With the use of 
warm-start, even faster convergence can be expected. 

4.1.3 Finding a Violated Label Assignment 

The following optimization problem corresponds to finding the most violated y 

min G{<x,y) = l'a-la'(KQ-yy')a. (18) 



The first term in the objective does not relate to y, so Eq. (jl8p is rewritten as 

niax l;a'(KQyy')a. (19) 



yeB 2 

However, this is a concave QP and cannot be solved efficiently. Note that while the use 
of the most violated constraint may lead to faster convergence, the cuttin g plane algo- 
rithm only requires t he ad dition of a violated constraint at each iteration ( Kelley . 196d : 



Tsochantaridis et al. . 20061 ). Hence, we propose in the following a simple and efficient 



method for finding a violated label assignment. 
Consider the following equivalent problem: 

maxy'Hy, (20) 
where H = K (aa') is a psd matrix. Let y E C be the following suboptimal solution of 

y = arg max^g^ Y Hy . 
Consider an optimal solution of the following optimization problem 

y* = argmax^gg y Hy. (21) 

We have the following proposition. 

Proposition 4 y* is a violated label assignment i/y'Hy* 7^ y'Hy. 

Proof From y'Hy* 7^ y'Hy, we have y* 7^ y. Suppose that (y*)'Hy* < y'Hy, then 
(y*)'Hy* -Fy'Hy - 2(y*)'Hy < 2y'Hy - 2(y*)'Hy < which contradicts with (y*)'Hy* + 
y'Hy - 2(y*)'Hy = (y* - y)'H(y* - y) > 0. So, (y*)'Hy* > y'Hy which indicates y* is a 
violated label assignment. ■ 



As for solving Eq. (j2ip . it is a integer linear program for y. We can rewrite this as 

max r'y = r'^y^ + r^^y^^ (22) 



y 

s-t. y£ = y£,yw G {±1} 



' N-l I ' 

where r = Hy. Since yc is constant, we have the following proposition. 
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Proposition 5 At optimality, yi > yj if ri > rj, i,j G U. 

Proof Assume, to the contrary, that the optimal y does not have the same sorted or- 
der as r. Then, there are two label vectors yi and yj, with > rj but yi < yj. Then 
^iili + rjjjj < Viyj + rjiji as (r^ — rj){yi — yj) < 0. Thus, y is not optimal, a contradiction. ■ 

Thus, with Proposition [5l we can solve Eq. (l22p by first sorting ru. The label assignment 
of yi's aligns with the sorted values of rj's for i & U. To satisfy the balance constraint 
^ = 1^, the first \^ {{N - - ^I'yc))] of y^'s are assigned -1, while the last 
(A^ ~ ~ |"i ((-^ ~ 0(1 ~ jl'y/:))] of them are assigned 1. Therefore, the label assignment 
in problem Eq. (|22p can be determined exactly and efficiently by sorting. 

To find a violated label, we first get the y S C, which takes 0{N'^) (resp. 0{N)) time 
when a nonlinear (resp. lineaiH) kernel is used; next we obtain the y* in Eq. (l2ip . which takes 
0{N log N) time; and finally check if y* is a violated label assignment using Proposition [H 
which takes 0{N'^) (resp. 0{N)) time for a nonlinear (resp. linear) kernel. In total, this 
takes 0{N'^) (resp. 0(iV log A^)) time for nonlinear (resp. linear) kernel. Therefore, our 
proposal is computationally efficient. 

Finally, after finishing the training process, we use /(x) = Ylt=i'^t4'{'^) the predic- 
tion function. Algorithm [2] summarizes the pseudocode of WellSVM for semi-supervised 
learning. 

Algorithm 2 WellSVM for semi-supervised learning. 
1: Initialize y, C = and obtain the optimal {/i, W} or a from Eq. (116p . 
2: repeat 

3: Update C ^ {y*}UC. 

4: Obtain the optimal {/i, W} or a from Eq. (|16p . 
5: Find the optimal solution y* of Eq. (j2ip . 

6: until G{a, y*) > miuygc G{cx, y) — e or the decrease of objective value is smaller than 
a threshold. 

7: Output /(x) = X]t=i '^t'Pi^) o^'^ prediction function. 



4.2 Mult i- Instance Learning 

In this section, we consider the second weakly labeled learning problem, namely, multi- 
instance learning (MIL), where examples are bags containing multiple instances. More 
formally, we have a data set V = {Bi,yi}^i, where Bj = {xj^i, . . . , Xj^mi} is the input 
bag, yi £ {±1} is the output and m is the number of bags. Without loss of generality, 
we assume that the positive bags are ordered before the negative bags, i.e., yi = 1 for all 
1 < i < p and —1 otherwise. Here, p and m — p are the numbers of positive and negative 
bags, respectively. In traditional MIL, a bag is labeled positive if it contains at least one key 

3. When the hnear kernel is used, Ea. pO|l can be rewritten as max {a y)'X'X(a y), where X = 

yec 

[xi, . . . , xat]. Hence, one can first compute o = X(a y) and then compute o'o. This takes a total of 
0{N) time. A similar trick can be used in checking if y* is a violated label assignment. 
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(or positive) instance, and negative otherwise. Thus, we only have the bag labels available, 
while the instance labels are only implicitly known. 

Identification of the key instances from positive bags can be very useful in CBIR. Specif- 
ically, in CBIR, the whole image (bag) can be represented by multiple semantic regions 
(instances). Explicit identification of the regions of interest (ROIs) can help the user in 
recognizing images he/she wants quickly especially when the system returns a large amount 
of images. Consequently, the problem of determining whether a region is ROI can be posed 
as finding the key instances in MIL. 

Traditional MIL implies that the label of a bag is determined by its most representative 
key instance, i.e., /(Bj) = max{/(xj^i), • • • ,/(xj^mi)}- Let = ^||w||| an d (■fi'D) be the 



sum of hinge losses on the bags, Eq.([2]) then leads to the MI-SVM proposed in (jAndrews et al 



20031 ): 



1 

min ||w||2 + CiV^, + C2 V ii (23) 
w,5 z — ' — ' 

1=1 i=p+i 

s.t. Ui max w'(j){xi j) > 1 — i = 1, . . . , m. 

l<j<m, 

Here, Ci and C2 trade off the model complexity and empirical losses on the positive and 
negative bags, respectively. 

For a positive bag Bj, we use the binary vector dj = - ,(ii,mj' S {0,1}™* to 

indicate which instance in Bj is its key instance. Following the traditional MIL setup, we 
assume that each positive bag has only one key instanceEl, and so XlJ^i = 1- In the fol- 
lowing, let d = [di, . . . , dp], and A be its domain. Moreover, note that maxKjx^^ w'(;/)(xjj) 
in Eq. (f23]l can be written as maxd^ X^^i dij'w'(j){xij). 

For a negative bag Bj, all its instances are negative and the corresponding constraint 
Eq. (123p can be replaced by — w'(/)(xjj) > 1 — ^j for every instance Xj^- in Bj. Moreover, we 
relax the problem by allowing the slack variables ^j's to be different for different instances 
in Bj. This leads to a set of slack variables {^s{i,j)}i=p+i,...,m;j=i,...,mi, where the indexing 
function s{i, j) = Ji_i — Jp + j + p ranges from p+1 to q = N— Jp+p and Jj = Ylt=i 
(Jo is set to 0). Combining all these together, Eq. (123p can be rewritten as: 

^ p m rrii 

minmin -\\.^\\l + CiY.^, + C2 J2Y.U^,J) (24) 

' i=l i=p+l J=l 

s.t. ^ Vi^'djj0(xjj) > 1 - ^j, i = l,...,p, 

-wV(xij) > 1 - i=p+l,...,m,j = l,...,mi. 

The inner minimization problem is usually written in its dual, as: 

max G{a,d} = l'a-^{aQyy(K'^){aQy), (25) 



4. Sometimes, one can allow for more than one key instance s in a positive bag (|Wang et al.l . I2OO8I : 
IXu and Fran^. 12004 [Zhou and Zhand . I2OO7I : IZhou et all . |2012| ). The proposed method can be extended 
to this case by setting X^j^i ~ where v is the known number of key instances. 
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where cx = [qi, . . . , aq]' G M'' is the vector of dual variables, A = {a | Ci > > 0, C2 > 
aj >0,i = 1,. . . ,p;j = p+1, . . . ,q}, y = [Ip, -lq~p] G K'', G K''^'' is the kernel matrix 
where Kfj = {rpfy{il^f) with 



-d ^ I E7=i d^,jH^i,j) i = l,...,p, ^26) 
' \ H^siij)) i=p + l,...,m;j = l,...,mi. 

Thus, Eq. ()25p is a mixed-integer programming problem. With Proposition 1, we have 

maxmin I'a - ;^(q: y)' V f^tK'^' Va y), (27) 
aeA^eM 2 V / 

tidiSA 

which is a convex relaxation of Eq. ()25p . 
4.2.1 Algorithm 

Similar to semi-supervised learning, the cutting plane algorithm is used for solving Eq. ()27p . 
Recall that there are two issues in the use of cutting-plane algorithms, namely, efficient 
multiple label- kernel learning and the finding of a violated label assignment. For the first 
issue, suppose that the current C is {di, . . . , dj-}, the MKL problem in Eq. (p7|) corresponds 
to the following primal problem: 



mm 



T p m rtii 

1. 25Z^II-*ll' + ^iE^^ + ^2 E E^^(M) (28) 

M6A4,W=[wi,..;wt],$ 2^//^ ^ 

T / rui \ 
E ^^'tdij4>{^i,j) > 1 - 6, « = 1, • • • 

T 

-^^'t4>{^s{i,j)) > 1 - is{i,j), i=p+l,...,m; j = l,...,mi. 

t=i 

Therefore, we can still apply the MKLGL algorithm to solve MKL problem in Eq. (j27p 
efficiently. As for the second issue, one needs to solve the following problem: 

min I'o!- -(Q:0y)'K'^(Q;0y), 
deA 2^ "^^ ^ 

which is equivalent to 

E- ■ .^i^^jViVji'^fyi'^'j)- 

deA ^ — ^«J=1 

According to the definition of in Eq. (l26p . this can be rewritten as 

2 



max 

deA 



i=l i=l i=p+l j=l 



which can be reformulated as 



max d'Hd + r'd, (29) 
deA 
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where H G R-'p^-'p and r G R-^p^^. Let v{i,j) = Ji_i + j, i G 1, . . . ,p, j G 1, . . . , mj, we have 

It is easy to verify that H is psd. 

Eq. (j29p is also a concave QP whose globally optimal solution, or equivalently the most 
violated d, is intractable in general. In the following, we adapt a variant of the simple 
yet efficient method proposed in Section [4.1.31 to find a violated d. Let d G C, where C = 
{di, . . . , df}, be the following suboptimal solution of Eq. (l29p : d = argmax^g^ d'Hd + r'd. 
Let d* be an optimal solution of the following optimization problem 



r'd 

d* = argmaxdg^ d'Hd + 



(30) 



Proposition 6 d* is a violated label assignment when (d*)'Hd + ^^^y- > d'Hd + 

Proof _Froiii (d*yHd + ^ > d'Hd + we have d* / d. Suppose that (d*)'Hd* + 
r'd* < d'Hd + r'd. Then 

'(d*)'Hd* + r'd*) + (d'Hd + r'd) - [2(d*)'Hd + r'd + r'd* 



< 0, 



< 2 
which contradicts 
(d*)'Hd* + r'd*) + (d'Hd + r'd 



d'Hd + r'd - (d*)'Hd - — 



r'd r'd* 



2(d*)'Hd + r'd + r'd* = (d* - d)'H(d* - d) 

> 0. 

So, (d*)'Hd* + r'(d*) > d'Hd + r'd, which indicates that d* is a violated label assignment. 



Similar to Eq. (j2ip . Eq. (j30p is also a linear integer program but with different constraints. 
We now show that the optimal d* in Eq. (l30p can still be solved via sorting. Notice that 
Eq. (j30p can be reformulated as 

max r'd (31) 
d 

s.t. I'd, = l,diG{0,l}™%i = l,...,p, 

where r = Hd+ ^. As can be seen, dj's are decoupled in both the objective and constraints 
of Eq. ()3ip . Therefore, one can obtain its optimal solution by solving the p subproblems 
individually: 

mi 

max rj._^+jdij 

S.t. i'di = i,diG{o,ir\ 

It is evident that the optimal dj can be obtained by assigning d-~- = 1, where i is the index 
of the largest element among [rj._-^+i, . . . and the rest to zero. Similar to semi- 

supervised learning, the complexity to find a violated d scales as O(iV^) (resp. 0(iV log A^)) 
when the nonlinear (resp. linear) kernel is used, and so is computationally efficient. 
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On prediction, each instance x can be treated as a bag, and its output from the 
WellSVM is given by /(x) = Ylt=i'^t^i^)- Algorithm [3] summarizes the pseudocode 
of WellSVM for multi-instance learning. 

Algorithm 3 WellSVM for multi-instance learning. 

1: Initialize d,C = and obtain the optimal {/i, W} or a from Eq. (j29p . 
2: repeat 

3: Update C ^ {d*}UC. 

4: Obtain the optimal {/x, W} or a from Eq. ()29p . 
5: Find the optimal solution d* of Eq. ([30|) . 

6: until G{a,d*) > minjec G{a,d.) — e or the decrease of objective value is smaller than 
a threshold. 

7: Output /(x) = Ylt=i Wj'^(x) as the prediction function. 



4.3 Clustering 

In this section, we consider the third weakly labeled learning task, namely, clustering, 
where all the class labels are unknown. Similar to semi-supervised learning, one can obtain 
a trivially "optimal" solution with infinite n iargin by assign ing all patterns to the same 
cluster. To prevent such a useless solution, Xu et al. ( 20051 ) introduced a class balance 
constraint 

-/3 < I'y < /3, 

where y = [yi, . . . , yN]' is the vector of unknown labels, and /3 > is a user-defined constant 
controlling the class imbalance. 

Let n{f) = and ijiV) be the sum of hinge losses on the individual examples. 

Eq.dS]) then leads to 



minmin ^||w||2 + CVCj (32) 
S.t yjwV(xi) > 1 - .^i, i = 1 . . . , iV, 

where = {y | € {+1,-1}, i = l,...,iV;— /3 < I'y < /3}. The inner minimization 
problem is usually rewritten in its dual 

minmax ~ 9 X] "'"j (^3) 

2 = 1 *ij = l 

S.t. C > > 0, i = 1. . . ,iV, 

where is the dual variable for each inequality constraint in Eq. (j32p . Let a = [qi, • • • , a^]' 
be the vector of dual variables, and A = {a | CI > o; > 0}. Then Eq. ()33p can be rewritten 
in matrix form as: 

1 



mm 



yeemaxc^g^ G(q;, y) := I'a - -a' K yy a. (34) 
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This, however, is still a mixed integer programming problem. 
With Proposition 1, we have 



l'a-ia'( ntKQyty't)a (35) 



max mm 



as a convex relaxation of Eq. (l34p . Note that G{a,y) can be reformulated by G{a,'M.y) = 
Vol — ^q'^K My^Q, where G is concave in o; and linear in My. Hence, according to 
Theorem 1, WellSVM is at least as tight as the SDP relaxation in ()Xu et ahLlioOsI ). 



4.3.1 Algorithm 

The cutting plane algorithm can still be applied for clustering. Similar to semi-supervised 
learning, the MKL can be formulated as the following primal problem: 

T N 
1 1 



^ II l|2 

mm — > — ||wt|| 



+ cY.^^ (36) 



/ieA4,W=[wi;...;WT],^ 2 ■f—f III ■ -, 

t = L l=\ 

T 

s-t. ^ytiWf(?!)(xi) > 1 - ^i, i = l,...,N, 

and its dual is 



t=i 



min max Vol - t^ol /UfK yty't ^l^ 



which is the same as Eq. (j35p . Therefore, MKLGL algorithm can still be applied for solving 
the MKL problem in Eq. (|35p efficiently. 

As for finding a violated label assignment, let y G C be 

y = arg max^g^ Y Hy , 

where H = K (qq') is a positive semidefinite matrix. Consider an optimal solution of 
the following optimization problem 

y* = argmax^gg y Hy. (37) 

With Proposition HI we obtain that y* is a violated label assignment if y'Hy* > y'Hy . 
Note that Eq. ()37p is a linear program for y and can be formulated as 

max r'y (38) 
y 

S.t. -/3<y'l</3,yG{-l,+l}^, 

where r = Hy. From Proposition [5l we can solve Eq. (|38p by first sorting r^'s. The label 
assignment of y^'s aligns with the sorted values of rj's. To satisfy the balance constraint 
— /3 < I'y < /3, the first of y^'s are assigned —1, the last of them are assigned 
1. The rest are assigned values from —1 to 1 such that the objective r'y is maximized. 
Similar to semi-supervised learning, the complexity to find a violated label scales as 0{N'^) 
(resp. O(A^logA^)) when the nonlinear (resp. linear) kernel is used, and so is computa- 
tionally efficient. Finally, we use /(x) = Ylt=i ^ ^^e prediction function. Algorithm [H 
summarizes the pseudo codes of WellSVM for clustering. 



19 



Li, Tsang, Kwok and Zhou 



Algorithm 4 WellSVM for clustering. 
1: Initialize y, C = and obtain the optimal {/i, W} or ol from Eq. ()37p . 
2: repeat 

3: Update C ^ {y*}UC. 

4: Obtain the optimal {/i,W} or a from Eq. (j37p . 
5: Find the optimal solution y* of Eq. (]37p . 

6: until G{cx,y*) > miuygc G{cx,y) — e or the decrease of objective value is smaller than 
a threshold. 

7: Output /(x) = Wj(^(x) as the prediction function. 



5. Experiments 



In this section, comprehensive evaluations are performed to verify the effectiveness of the 
proposed WellSVM. Experiments are conducted on all the three aforementioned weakly 
labeled learning tasks: semi-supervised learning (Section lS.ip . multi-instance learning (Sec- 
tion [S^lL^Iiidustering (Section 15. Sh . The WellSVM is implen iented using the L IB- 
SVM (|Fan et al.l . boosl ) for nonlinear kernels, and the LIBLINEAR (|Hsieh et al.l . \2004 ) for 
the linear kernel. Experiments are run on a 3.20GHz Intel Xeon(R)2 Duo PC running Win- 
dows 7 with 8GB main memory. For all the other methods that will be used for comparison, 
the default stopping criteria in the corresponding packages are used. For the WellSVM, 
both the e and stopping threshold in Algorithm [1] are set to 10~^. 



5.1 Semi-Supervised Learning 

We first evaluate the WellSVM on semi-supervised learning with a large collection of real- 
world data sets. 16 UCI data sets, which cover a wide range of properties, and 2 large-scale 
data set^ are used. Table [D shows some statistics of these data sets. 



Table 1: Data sets used in the experiments. 





Data 


# Instances 


# Features 




Data 


^ Instances 


# Features 


1 


Echocardiogram 


132 


8 


10 


Cleanl 


476 


166 


2 


House 


232 


16 


11 


Isolet 


600 


51 


3 


Heart 


270 


9 


12 


Australian 


690 


42 


4 


Heart-stalog 


270 


13 


13 


Diabetes 


768 


8 


5 


Haherman 


306 


14 


14 


German 


1,000 


59 


6 


LiveDiscorders 


345 


6 


15 


Krvskp 


3,196 


36 


7 


Specif 


349 


44 


16 


Sick 


3,772 


31 


8 


Ionosphere 


351 


34 


17 


real-sim 


72,309 


20,958 


9 


House-votes 


435 


16 


18 


rcvl 


677,399 


47,236 



5. |http : //www. csie .ntu. edu. tw/~ cjlin/libsvmtools/datasets/binary .html 
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5.1.1 Small-Scale Experiments 

For each UCI data set, 75% of the examples are randomly chosen for training, and the 
rest for testing. We investigate the performance of each approach with varying amount 
of labeled data (namely, 5%, 10% and 15% of all the labeled data). The whole setup is 
repeated 30 times and the average accuracies (with standard deviations) on the test set are 
reported. 

We compare WellSVM with 1) the standard SVM (using labeled data only), and three 
state-of-the-art semi-supervised SVMs (S^VMs), namely 2) Transdu ctive SVM (TSVMjfl 



im^): si Laula c ian S VM (LapSVM^ (jBelkin et al.1 . boOfil ): and 4) UniverSVM 



Collobert etall . \2004 ). Note that TSVM and USVM adopt the same objeo 



( Joachirni . 
(USVMjl 

five as WellSVM, but with different optimization strategies (local search and constrained 
convex-concave procedure, respectiv ely) . LapS VM is another S^VM based on the mani- 
fold assumption dBelkin et al.1 . 12OO6I) . The SDP-based S^VMs (|Xu and Schuurmand . lioosl : 
De Bie and Cristianini l2006l ) are not compared, as they do not converge after 3 hours on 
even the smallest data set (Echocardiogram) . 

Parameters of the different methods are set as follows. Ci is fixed at 1 and C2 is selected 
in the range {0.001,0.005,0.01,0.05,0.1,0.5,1}. The linear and Gaussian kernels are used 
for all SVMs, where the width a of the Gaussian kernel A:(x, x) = exp(— ||x — x|p/2cr^) is 
picked from {0.25^/7, 0.5^/7 , ^77, 2^/7, 4.^/7}, with 7 being the average distance between all 
instance pairs. The initial label assignment of WellSVM is obtained from the predictions 
of a standard SVM. For LapSVM, the number of nearest neighbors in the underlying data 
graph is selected from {3,5,7,9}. All parameters are determined by using the five-fold 
cross- validated accuracy. 

Table[2]shows the results on the UCI data sets with 5% labeled examples. As can be seen, 
WellSVM obtains highly competitive performance with the other methods, and achieves 
the best improvement against SVM in te rms of both the counts of (^^wins — Closes) as well 
as average accuracy. The Friedman test ( Demvsar . 20061 ) shows that both WellSVM and 
USVM perform significantly better than SVM at the 90% confidence level, while TSVM 
and LapSVM do not. 

As can be seen, there are cases where unlabeled data cannot help for TSVM, USVM and 
WellSVM. Besides the local minimum problem, another possible reason may be that the 
labeled examples are too few to provide reliable model selection. Moreover, overall, LapSVM 
cannot obtain good per formance, which may b e due to that the manifold assumption does 



not hold on these data ( Chapelle et al. . 2006bl ). 



Tables [3] and m show the results on the UCI data sets with 10% and 15% labeled examples, 
respectively. As can be seen, as the number of labeled examples increases, SVM gets much 
better performance. As a result, both TSVM and USVM cannot beat the SVM. On the 
other hand, the Friedman test shows that WellSVM still performs significantly better than 
SVM with 10% labeled examples at the 90% confidence level. With 15% labeled examples, 
no S'^VM performs significantly better than SVM. 



http : // svmlight . joachims . org/| 

http : //manifold. cs .uchicago . edu/man if old_regularization/ software .html] 
http : //mloss . org/ sof tware/view/ 19/ ^ 
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Table 2: Accuracies on the various data sets with 5% labeled examples. The best perfor- 
mance on each data set is bolded. The win/tie/loss counts (paired t-test at 95% 
significance level) are listed. The method with the largest number of (#wins - 
flosses) against SVM as well as the best average accuracy is also highligh ted. 
Number in parentheses denotes the ranking (computed as in (|Demvsai] . l2006l ^) of 
each method on the data set. 



Data 


SVM 


TSVM 


LapSVM 


USVM 


WellSVM 


Echocardiogram 


0.80 ± 0.07 (2.5) 


0.74 ± 0.08 (4) 


0.64 ± 0.22 (5) 


0.81 ± 0.06 (1) 


0.80 ± 0.07 (2.5) 


House 


0.90 ± 0.04 (3) 


0.90 ± 0.05 (3) 


0.90 ± 0.04 (3) 


0.90 ± 0.03 (3) 


0.90 ± 0.04 (3) 


Heart 


0.70 ± 0.08 (5) 


0.75 ± 0.08 (3) 


0.73 ± 0.09 (4) 


0.76 ± 0.07 (2) 


0.77 ± 0.08 (1) 


Heart-statlog 


0.73 ± 0.10 (4.5) 


0.75 ± 0.10 (1.5) 


0.74 ± 0.11 (3) 


0.75 ± 0.12 (1.5) 


0.73 ± 0.12 (4.5) 


Haberman 


0.65 ± 0.07 (3) 


0.61 ± 0.06 (4) 


0.57 ± 0.11 (5) 


0.75 ± 0.05 (1.5) 


0.75 ± 0.05 (1.5) 


LiverDisorders 


0.56 ± 0.05 (2) 


0.55 ± 0.05 (3.5) 


0.55 ± 0.05 (3.5) 


0.59 ± 0.05 (1) 


0.53 ± 0.07 (5) 


Specif 


0.73 ± 0.05 (2) 


0.68 ± 0.10 (4) 


0.61 ± 0.08 (5) 


0.74 ± 0.05 (1) 


0.70 ± 0.07 (3) 


Ionosphere 


0.67 ± 0.06 (4) 


0.82 ± 0.11 (1) 


0.65 ± 0.05 (5) 


0.77 ± 0.07 (2) 


0.70 ± 0.08 (3) 


House-votes 


0.88 ± 0.03 (3) 


0.89 ± 0.05 (1.5) 


0.87 ± 0.03 (4) 


0.83 ± 0.03 (5) 


0.89 ± 0.03 (1.5) 


Cleanl 


0.58 ± 0.06 (4) 


0.60 ± 0.08 (3) 


0.54 ± 0.05 (5) 


0.65 ± 0.05 (1) 


0.63 ± 0.07 (2) 


Isolet 


0.97 ± 0.02 (3) 


0.99 ± 0.01 (1) 


0.97 ± 0.02 (3) 


0.70 ± 0.09 (5) 


0.97 ± 0.02 (3) 


Australian 


0.79 ± 0.05 (4) 


0.82 ± 0.07 (1) 


0.78 ± 0.08 (5) 


0.80 ± 0.05 (3) 


0.81 ± 0.04 (2) 


Diabetes 


0.67 ± 0.04 (4) 


0.67 ± 0.04 (4) 


0.67 ± 0.04 (4) 


0.70 ± 0.03 (1) 


0.69 ± 0.03 (2) 


German 


0.70 ± 0.03 (2) 


0.69 ± 0.03 (4) 


0.62 ± 0.05 (5) 


0.70 ± 0.02 (2) 


0.70 ± 0.02 (2) 


Krvskp 


0.91 ± 0.02 (3.5) 


0.92 ± 0.03 (1.5) 


0.80 ± 0.02 (5) 


0.91 ± 0.03 (3.5) 


0.92 ± 0.02 (1.5) 


Sick 


0.94 ± 0.01 (2) 


0.89 ± 0.01 (5) 


0.90 ± 0.02 (4) 


0.94 ± 0.01 (2) 


0.94 ± 0.01 (2) 


SVM: win/tie/loss 


5/7/4 


8/7/1 


2/9/5 


3/6/7 


avc. acc. 


0.763 


0.767 


0.723 


0.770 


0.778 


avc. rank 


3.2188 


2.8125 


4.2813 


2.2188 


2.4688 



Figure [T] compares the average CPU time of WellSVM with the other S'^VMs at dif- 
ferent numbers of labeled examples. As can be seen, TSVM is the slowest while USVM 
is the most efficient. WellSVM is comparable to LapSVM. Figure [2] shows the objec- 
tive values of WellSVM on five representative UCI data sets. We can observe that 
the nur nber of iterations is always fewer than 25. As mention ed above, the SDP-based 
S^VMs ( Xu and SchuurmansL 20051 : De Bie and Cristianini . 20061 ). in contrast, cannot con- 
verge in 3 hours even on the smallest data set Echocardiogram. Hence, WellSVM scales 
much better than these SDP-based approaches. 



5.1.2 Large-Scale Experiments 

In this section, we study the scalability of the proposed WellSVM and other state-of-the- 
art approaches on two large data sets, real-sim and RCVl. The real-sim data has 20,958 fea- 
tures and 72,309 instances, while the RCVl data has 47,236 features and 677,399 instances. 
The linear kernel is used. The S^VMs compared in Section [5. 1.1 1 are for general kernels and 
cannot converge in 24 hours. Hence, to conduct a fair co mparison, an efficient liiiear S'^ VM 
solver, namely, SVMlirifl using deterministic annealing ( Sindhwani and Keerthi . 20061 ) . is 
employed. All the parameters are determined in the same manner as in Section [5. 1.1[ 



9. |http : //vikas . sindhwani . org/svmliii.html| 
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Table 3: Accuracies on the various data sets with 10% labeled examples. 



Data 


SVM 


TSVM 


LapSVM 


USVM 


WellSVM 


Echocardiogram 


0.81 ± 0.05 (2.5) 


0.76 ± 0.12 (4) 


0.69 ± 0.14 (5) 


0.82 ± 0.05 (1) 


0.81 ± 0.05 (2.5) 


House 


0.90 ± 0.04 (2.5) 


0.92 ± 0.05 (1) 


0.89 ± 0.04 (4) 


0.83 ± 0.03 (5) 


0.90 ± 0.04 (2.5) 


Heart 


0.76 ± 0.05 (3.5) 


0.75 ± 0.05 (5) 


0.76 ± 0.06 (3.5) 


0.78 ± 0.05 (1.5) 


0.78 ± 0.04 (1.5) 


Heart-statlog 


0.79 ± 0.03 (4) 


0.74 ± 0.05 (5) 


0.80 ± 0.04 (2.5) 


0.80 ± 0.04 (2.5) 


0.81 ± 0.04 (1) 


Haberman 


0.75 ± 0.04 (2) 


0.60 ± 0.07 (4.5) 


0.60 ± 0.07 (4.5) 


0.75 lb 0.04 (2) 


0.75 lb 0.04 (2) 


LiverDisorders 


0.59 ± 0.06 (1) 


VJ.iJl _1_ \J.\JiJ 


U.tJiJ _1_ \J .\J\j 


\J.iJiJ _l_ \J.\J\J y'-J) 


W.iJI _1_ \J.\JfJ lii.OI 


Specif 


0.74 ± 0.05 (2) 


0.76 ± 0.06 (1) 


0.64 ± 0.06 (5) 


0.72 ± 0.06 (3.5) 


0.72 lb 0.07 (3.5) 


Ionosphere 


0.78 ± 0.07 (4) 


0.90 ± 0.04 (1) 


0.66 ± 0.06 (5) 


0.88 ± 0.05 (2) 


0.82 lb 0.05 (3) 


House-votes 


0.92 ± 0.03 (1.5) 


0.91 ± 0.03 (3.5) 


0.88 ± 0.04 (5) 


0.91 ± 0.03 (3.5) 


0.92 lb 0.03 (1.5) 


Cleanl 


0.69 ± 0.05 (3.5) 


0.71 ± 0.05 (2) 


0.63 ± 0.07 (5) 


0.72 ± 0.05 (1) 


0.69 lb 0.04 (3.5) 


Isolet 


0.99 ± 0.01 (2.5) 


1.00 ± 0.01 (1) 


0.96 ± 0.02 (4) 


0.52 ± 0.03 (5) 


0.99 lb 0.01 (2.5) 


Australian 


0.81 ± 0.03 (5) 


0.84 ± 0.03 (1.5) 


0.82 ± 0.04 (4) 


0.84 ± 0.03 (1.5) 


0.83 lb 0.03 (3) 


Diabetes 


0.70 ± 0.03 (4.5) 


0.70 ± 0.05 (4.5) 


0.71 ± 0.04 (3) 


0.72 ± 0.03 (2) 


0.74 lb 0.03 (1) 


German 


0.67 ± 0.03 (3.5) 


0.67 ± 0.03 (3.5) 


0.66 ± 0.04 (5) 


0.70 ± 0.02 (1.5) 


0.70 lb 0.02 (1.5) 


Krvskp 


0.93 ± 0.01 (3) 


0.93 ± 0.01 (3) 


0.86 ± 0.04 (5) 


0.93 ± 0.01 (3) 


0.94 lb 0.01 (1) 


Sick 


0.93 ± 0.01 (2) 


0.89 ± 0.01 (5) 


0.92 ± 0.01 (4) 


0.93 ± 0.01 (2) 


0.93 lb 0.01 (2) 


SVM: win/tie/loss 


5/8/3 


10/5/1 


5/6/5 


0/9/7 


avg. acc. 


0.799 


0.789 


0.753 


0.774 


0.807 


avg. rank 


2.9375 


3.0000 


4.2813 


2.6250 


2.1563 



Table 4: Accuracies on various data sets with 15% labeled examples. 



Data 


SVM 


TSVM 


LapSVM 


USVM 


WellSVM 


echocardiogram 


0.83 lb 0.04 (2.5) 


0.76 lb 0.07 (4) 


0.75 lb 0.08 (5) 


0.85 lb (1) 


0.83 lb 0.04 (2.5) 


house 


0.92 lb 0.04 (2.5) 


0.94 lb 0.04 (1) 


0.83 lb 0.11 (5) 


0.91 lb 0.04 (4) 


0.92 lb 0.03 (2.5) 


heart 


0.78 lb 0.06 (3) 


0.78 lb 0.05 (3) 


0.79 lb 0.05 (1) 


0.78 lb 0.07 (3) 


0.78 lb 0.06 (3) 


heart-statlog 


0.76 lb 0.06 (2) 


0.74 lb 0.06 (4) 


0.79 lb 0.05 (1) 


0.73 lb 0.07 (5) 


0.75 lb 0.06 (3) 


haberman 


0.72 lb 0.03 (3) 


0.62 lb 0.07 (5) 


0.63 lb 0.11 (4) 


0.74 lb (1.5) 


0.74 lb (1.5) 


liverDisorders 


0.61 lb 0.05 (1) 


0.54 lb 0.06 (4) 


0.53 lb 0.07 (5) 


0.58 lb (2) 


0.56 ± 0.06 (3) 


spectf 


0.77 lb 0.03 (2) 


0.79 lb 0.04 (1) 


0.6 lb 0.1 (5) 


0.74 lb (4) 


0.75 ± 0.06 (3) 


ionosphere 


0.76 lb 0.04 (5) 


0.9 lb 0.04 (1) 


0.83 lb 0.04 (4) 


0.89 lb 0.04 (2) 


0.84 ± 0.03 (3) 


house-votes 


0.92 lb 0.02 (1.5) 


0.92 lb 0.03 (1.5) 


0.9 lb 0.03 (3) 


0.83 lb 0.03 (5) 


0.89 lb 0.02 (4) 


cleanl 


0.71 lb 0.04 (4) 


0.74 lb 0.04 (2) 


0.63 lb 0.07 (5) 


0.76 lb 0.06 (1) 


0.72 lb 0.04 (3) 


isolet 


0.98 lb 0.01 (3.5) 


0.99 lb 0.01 (1.5) 


0.98 lb 0.01 (3.5) 


0.54 lb 0.02 (5) 


0.99 lb 0.01 (1.5) 


australian 


0.86 lb 0.02 (1.5) 


0.85 lb 0.03 (3) 


0.83 lb 0.02 (4.5) 


0.83 lb 0.03 (4.5) 


0.86 lb 0.03 (1.5) 


diabetes 


0.75 lb 0.03 (1.5) 


0.73 lb 0.02 (3.5) 


0.73 lb 0.03 (3.5) 


0.72 lb 0.04 (5) 


0.75 lb 0.03 (1.5) 


german 


0.71 lb 0.01 (2) 


0.7 lb 0.03 (3.5) 


0.68 lb 0.04 (5) 


0.7 lb 0.04 (3.5) 


0.72 lb 0.01 (1) 


krvskp 


0.95 lb 0.01 (1.5) 


0.93 lb 0.01 (4) 


0.91 lb 0.01 (5) 


0.94 lb 0.01 (3) 


0.95 lb 0.01 (1.5) 


sick 


0.94 lb (2) 


0.9 lb 0.01 (4.5) 


0.9 lb 0.12 (4.5) 


0.94 lb (2) 


0.94 lb (2) 


SVM: win/tie/loss 


8/3/5 


11/2/3 


6/6/4 


2/9/5 


avg. acc. 


0.809 


0.801 


0.771 


0.780 


0.811 


avg. rank 


2.4063 


2.9063 


4.0000 


3.2188 


2.3438 



In the first experiment, we study the performance at different numbers of unlabeled 
examples. Specifically, 1%, 2%, 5%, 15%, 35%, 55% and 75% of the data (with 50 of them 
labeled) are used for training, and 25% of the data are for testing. This is repeated 10 times 
and the average performance reported. 

Figure [3] shows the results. As can be seen, WellSVM is always superior to SVMlin, 
and achieves highly competitive or even better accuracy than the SVM as the number of 
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Figure 2: Number of WellSVM iterations on the UCI data sets. 



unlabeled examples increases. Moreover, WellSVM is much faster than SVMlin. As the 
number of unlabeled examples increases, the difference becomes more prominent. This is 
mainly becau se SVMlin employ s gradient descent while WellSVM (which is based on 



LIBLINEAR (iHsieh et al.l . l2008l ) ) uses coo rdinate des c ent, w hich is known to be one of the 



fastest solvers for large-scale linear S VMs (jShai et al.l . 120071 ) . 



Figure m shows the results on the larger RCVl data set. As can be seen, WellSVM 
obtains good accuracy at different numbers of unlabeled examples. More importantly. 
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Figure 3: Semi-supervised learning results on the real-sim data with different amounts of 
unlabeled examples. 



WellSVM scales well on RCVl. For example, WellSVM takes fewer than 1000 seconds 
with more than 500,000 instances. On the other hand, SVMlin cannot converge in 24 hours 
when more than 5% examples are used for training. 




(x677k) Number of Unlabeled Training Examples (x677k) Number of Unlabeled Training Examples 



Figure 4: Semi-supervised learning results on the RCVl data with different number of 
unlabeled examples. 



Our next experiment studies how the performance of WellSVM changes with different 
numbers of labeled examples. Following the setup in Section TS.l.H 75% of the examples are 
used for training while the rest are for testing. Different numbers (namely, 25,50,100,150, 
and 200) of labeled examples are randomly chosen. Since SVMlin cannot handle such a large 
training set, the SVM is used instead. The above process is repeated 30 times. Table [5] 
shows the average testing accuracy. As can be seen, WellSVM is significantly better than 
SVM in all cases. The high standard deviation of WellSVM on real-sim with 25 labeled 
examples may be due to the fact that the large amount of unlabeled instances lead to a 
large variance in deriving a large margin classifier, whereas the amount of labeled examples 
is too small to reduce the variance. 
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Table 5: Accuracy (with standard derivations) on the real-sim and rcvl data sets, with 
different numbers of labeled examples. Results for which the performance of 
WellSVM is significantly better than SVM are in bold. 



# of labeled examples 


25 


50 


100 


150 


200 


real-sim 


SVM 
WellSVM 


0.78 ± 0.03 
0.81 ± 0.08 


0.81 ± 0.02 
0.84 ± 0.02 


0.84 ± 0.02 
0.89 ± 0.01 


0.86 ± 0.01 
0.9 ± 0.01 


0.88 ± 0.01 
0.91 ± 0.01 


rcvl 


SVM 
WellSVM 


0.77 ± 0.03 
0.83 ± 0.03 


0.83 ± 0.01 
0.9 ± 0.02 


0.87 ± 0.01 
0.91 ± 0.01 


0.89 ± 0.01 
0.92 ± 0.01 


0.9 ± 0.01 
0.93 ± 0.01 



5.1.3 Comparison with Other Benchmarks in the Literature 

In this section, we further evaluate the proposed WellSVM with ot her published results in 
the literature. First, we experiment on the benchmark data sets in (IChapelle et al.1 . \2006\h 
by using their same setup. Results on the average test error are shown in Tabled As can 
be seen, WellSVM is highly competitive. 



Table 6: Tes t errors (%) on the S SL benchmark data sets (using 10 labeled examples) 
in (jChapelle et al.1 . l2006bl ). The SVM and TSVM results are from their Table 
21.9. 





g241c 


g241d 


Digit 1 


USPS 


COIL 


BCI 


Text 


SVM 

TSVM 

WellSVM 


47.32 
24.71 

37.37 


46.66 
50.08 
43.33 


30.60 
17.77 
16.94 


20.03 

25.20 
22.74 


68.36 
67.50 

70.73 


49.85 
49.15 
48.50 


45.37 
40.37 
33.70 



Next, we comp are WellSVM with the SVM and other state-of-the-art S^VMs reported 
in (jChapelle et alLbonsl V These include 



1. VS^VM dChapelle and which minimizes the S^VM objective by gradient 

descent; 



2. Continuation S^VM (cS^VM) dChapelle et al.1 . 12nn6al ;i. which first relaxes the S^VM 
objective to a continuous function and then employs gradient descent; 



3. USVM dCollobert et al.l . I2OO6I ): 



4. TSVM (Joachims. 1199 



5. Dete rministic annealing S VM with gradient minimization (VDA) ( Sindhwani et al 



which is based on the global optimization heuristic of deterministic annealing; 
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6. Newton S^VM (Newton) (jChapellel . l2007l l. which uses the second-order Newton's 
method; and 



7. Branch-and-bound (BB) (jChapelle et alJ . l2007l l. 



Results are shown in Table [71 As can be seen, BB attains the best performance. Overall, 
WellSVM performs slightly worse than VDA, but is highly competitive compared with 
the other S'^VM variants. 



of the WellSVM and various S'^VM va riants . 

2008i ). 



Table 7: Test errors 

S'^VMs compared are from Table 11 in (jChapelle et al. 



Results of the 
BB can only 

be run on the 2rnoons data set due to its high computational cost. Note that 



in (jChapelle et al.l . l2008l ). USVM is called CCCP and TSVM is called S^YM^^a^K 





SVM 


VS^VM 


cS^VM 


USVM 


TSVM 


VDA 


Newton 


BB 


WellSVM 


2moons 


35.6 


65.0 


49.8 


66.3 


68.7 


30.0 


33.5 


0.0 


33.5 


g50c 


8.2 


8.3 


8.3 


8.5 


8.4 


6.7 


7.5 




7.6 


text 


14.8 


5.7 


5.8 


8.5 


8.1 


6.5 


14.5 




8.7 


uspst 


20.7 


14.1 


15.6 


14.9 


14.5 


11.0 


19.2 




14.3 


coil20 


32.7 


23.9 


23.6 


23.6 


21.8 


18.9 


24.6 




23.0 



Finally, we compare WellSVM with MMC (jXu et all , lionil ;). a SDP-based S^VM, on 
the data sets used there. Table[8|shows the results. Again, WellSVM is highly competitive. 



Table 8: Test err ors f%) of WellSVM and MMC (a SDP-based S^VM) on the data sets 



used in (jXu et al.l . j2005l ). The MMC results are copied from their Table 2. 





HWD 1-7 


HWD 2-3 


Australian 


Flare 


Vote 


Diabetes 


MMC 


3.2 


4.7 


32.0 


34.0 


14.0 


35.6 


WellSVM 


2.7 


5.3 


40.0 


28.9 


11.6 


41.3 



5.2 Multi-Instance Learning for Locating ROIs 

In this section, we evaluate the proposed method on multi-instance learning, with applica- 



tion to ROI-location in CBIR image data. We employ the image database in (jZhou et al 



20051 ). which consists of 500 COREL images from five image categories: castle, firework, 
mountain, sunset and waterfall. Each image is of size 160 x 16 0, and is converted to th e 
multi- instance feature representation by the bag generator SBN ( Maron and Ratan . 19981 ). 
Each region (instance) in the image (bag) is of size 20 x 20. Some of these regions are 
labeled manually as ROIs. A summary of the data set is shown in Table [9l It is very labor- 
expensive to collect large image data with all the regions labeled. Hence, we will leave the 
experiments on large-scale data sets as a future direction. 

The one-vs-rest strategy is used. Specifically, a training set of 50 images is created by 
randomly sampling 10 images from each of the five categories. The remaining 450 images 
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Table 9: Some statistics of the image data set. 



concept 


T^images 


average #ROIs per image 


castle 


100 


19.39 


firework 


100 


27.23 


mountain 


100 


24.93 


sunset 


100 


2.32 


waterfall 


100 


13.89 



constitute the test set. This training/test spht is randomly generated 10 times, and the 
average performance reported. 

Although many multi-instance methods have been proposed, they mainly focus on im- 
proving the classification performance, whereas only some of them are used to identify 
the ROIs. We list these state-of-the-art methods (Andrews et al.. 20031 : Maron and Rat an . 



19981 : IZhang and Goldmanl . I2OO2I : Izhou et al.l . I2OO5I I as well as related SVM-based methods 
for comparisons in experime nts. Specifically , the W ellSVM is c ompared with the foll owing 
SVM variants: 1) MI-SVM d Andrews et alJ . l2003ll: 21 mi-SVM (I Andrews et al.l . I2OO3I ): and 
3) SVM with multi-instance kernel (MI-Kernel) ( Gartner et al. . 20021 ). The Gaussian kernel 
is used for all the SVMs, where its width a is picked from {0.25-^/7, 0.5-^, -y/7, 4^/7} 
with 7 being the average distance between instances; Ci is picked from {C2, 4C2, IOC2}; 
and C2 is from {1, 10, 100}. We also compare with three sta te-of-art non-SVM-based meth- 
ods t hat can locate ROIs, name ly. Diverse Density (DP) (Maron and Ratai] . 1998 ). EM- 
DD (jZhang and Goldman! . I2OO2I ) and CfcNN-ROI (jzhou et al.l . I2OO5I I. Ah "the parameters 
are selected by ten- fold cross-validatio n (except for CA:N N-ROI, in which its parameters are 
based on the best setting reported in (|Zhou et al.l . l200,^ );i. 

In each image classified as relevant by the algorithm, the image region with the maximum 
prediction value is taken as its ROf^. The following two measures are used in evaluating 
the performance of ROI location. 



success rate of relevant images 



number of ROI successes 



(39) 



number of relevant images 

Here, for each image predicted as relevant by the algorithm, the ROI returned by the 
algorithm is counted as a success if it is a real ROI. 

2. The ROI success rate computed based on those images that are predicted as relevant, 
i.e., 

number of ROI successes 
number of images predicted as relevant 



success rate of ROIs 



Notice that there is a tradeoff between these two measures. When an algorithm classifies 
many images as relevant, the success rate of relevant images (Eq. (j39p ) is high while the 
success rate of ROIs (Eq. ()30|) ) can be low, since there are many relevant images predicted 
by the algorithm. On the other hand, when an algorithm classifies many images as irrelevant, 
the success rate of ROIs is high while the success rate of relevant images is low since many 



10. Alternatively, if we allow an algorithm to output multiple ROI's for an image, a heuristic thresholding 
of the prediction values will be needed. For simplicity, we defer such a setup as future work. 
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Table 10: Success rate in locating the ROIs. The best performance and those which are 
comparable to the best performance (paired t-test at 95% significance level) on 
each data set are bolded. 



method 


castle 


firework 


mountain 


sunset 


waterfall 




WellSVM 


0.57 ± 0.12 


0.68 ± 0.17 


0.59 ± 0.10 


0.32 ± 0.07 


0.39 ± 0.13 


SVM 


mi-SVM 


0.51 ± 0.04 


0.56 ± 0.07 


0.18 ± 0.09 


0.32 ± 0.01 


0.37 ± 0.08 


methods 


MI-SVM 


0.52 ± 0.22 


0.63 ± 0.26 


0.18 ± 0.13 


0.29 ± 0.10 


0.06 ± 0.02 




MI-Kernel 


0.56 ± 0.08 


0.57 ± 0.11 


0.23 ± 0.20 


0.24 ± 0.03 


0.20 ± 0.11 




DD 


0.24 ± 0.16 


0.15 ± 0.28 


0.56 ± 0.11 


0.30 ± 0.18 


0.26 ± 0.24 


non-SVM 


EM-DD 


0.69 ± 0.06 


0.65 ± 0.24 


0.54 ± 0.18 


0.36 ± 0.15 


0.30 ± 0.12 


methods 


CfcNN-ROI 


0.48 ± 0.05 


0.65 ± 0.09 


0.47 ± 0.06 


0.31 ± 0.04 


0.20 ± 0.05 



relevant images are missing. To compromise these two goals, we introduce a novel success 
rate of ROIs 

2#R0I successes 

success rate 



T^trelevant images -|- T^predicted relevant images 
This is similar to the F-score in information retrieval as 

1 ^relevant images -|- ^^predicted relevant images 

success rate 27^ROI successes 

1 \ 



+ 



2 I #ROI successes #ROI successes I 

\ #relevant images #predicted relevant images / 

Intuitively, when an algorithm correctly recognizes all the relevant images and their ROIs, 
the success rate will be high. 

Table [TOl shows the success rates (with standard deviations) of the various methods. As 
can be seen, WellSVM achieves the best performance among all the SVM-based methods. 
As for its performance comparison with the other non-SVM methods, WellSVM is still 
always better than DD and CA;NN-ROI, and is highly comparable to EM-DD. In particular, 
EM-DD achieves the best performance on castle and sunset, while WellSVM achieves 
the best performance on the remaining three categories (firework, mountain and waterfall). 
Figure [5] shows some example images with the located ROIs. It can be observed that 
WellSVM can correctly identify more ROIs than the other SVM-based methods. 

In the following experiment, instead of reporting one ROI in each image, we observe 
the number of ROI successes when different numbers of most-confident bags are reported. 
As can be seen from Figure EJ the proposed WellSVM still achieves highly competitive 
performance. 
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Figure 5: ROIs located by (from left to right) DD, EM-DD, CA;NN-ROI, MI-SVM, mi- 
SVM, MI-Kernel, and WellSVM. Each row shows one category (top to bottom: 
firework, sunset, waterfall, castle and mountain). 



castle firework mountain 




Figure 6: Number of ROI successes when different numbers of most-confident bags are 
reported. 



5.3 Clustering 

In this section, we further evaluate our WellSVM on clustering problems where all the 
labels are unknown. As in semi-supervised learning, 16 UCI data sets and 2 large data sets 
are used for comparison. 
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5.3.1 Small-Scale Experiments 



The WellSVM is compared with the following methods: 1) fc- means cluster i ng (K M); 
2) kerne l fc-means clustering (KKM ); 3) normalize d cut (NC) ( Shi and Mali^ . boooll: 4) 
GMMC dValizadegan andlinl . bnO?!): 5) IterSVE0 (jzhang et al.l bon?! ): and gTcPHMCPI 



( Zhao et al 



200 



In t he prelini i nary experiment, we also compared with the original 



SDP-based approach in (|Xu et al.1 . 1200,^ ). However, similar to the experimental results 



in semi-supervised learning, it does not converge after 3 hours on the smallest data set 
echoc ardiogram. Hence, GMMC, which is also based on SDP but about 100 times faster 
than (|Xu et al.l . l2005l ). is used in the comparison. 

For GMMC, IterSVR, CPMMC and WellSVM, the C parameter is selected in a range 
{0.1,0.5,1,5,10,100}. For the UCI data sets, both the linear and Gaussian kernels are 
used. In particular, the width a of the Gaussian kernel is picked from {0.25^/7, 0.5^/7, ^/7, 
2^/7, 4^/7}, where 7 is the average distance between instances. The parameter of normalized 
cut is chosen from the same range of a. Since fc-means and IterSVR are susceptible to the 
problem of local minimum, these two methods are run 10 times and the average perfo r manc e 



reported. We set the balance constraint in the same manner as in (jZhang et al.l . 120071 ). 
i.e., /3 is set as 0.03A^ for balanced data and 0.3A^ for imbalanced data. To initialize 
WellSVM, 20 ra ndom label assig n ment s are generated and the one with the maximum 
kernel ahgnment dCristianini et al.l . H) is chosen. We also use this to initialize KM, 
KKM and IterSVR, and the resultant variants are denoted KM-r, KKM-r and IterSVR-r, 
respectively. All the methods are reported with the best parameter setting. 



We follow the strategy in (jXu et al.l . l2005l ) to evaluate the clustering accuracy. We first 
remove the labels for all instances, and then obtain the clusters by the various clustering 
algorithms. Finally, the misclassification error is measured w.r.t. the true labels. 

We first study the clustering accuracy on 16 UCI data sets that cover a wide range 
of properties. Results are shown in Table [TTl As can be seen, WellSVM outperforms 
existing clustering approaches on most data sets. Specifically, WellSVM obtains the best 
performance on 13 out of 16 data sets. GMMC is not as good as WellSVM. This may due 
to that th e convex relaxati on proposed in GMMC is not the same as the original SDP-based 
approach (|Xu et al.l . liooi ) and WellSVM. 

The CPU time on the UCI data sets are shown in Figure [71 As can be seen, local 
optimization methods, such as IterSVR and CPMMC, are often efficient. As for the global 
optimization method, WellSVM scales much better than GMMC. On average, WellSVM 
is about 10 times faster. These results validate that WellSVM achieves much better 
scalability than the SDP-based GMMC approach. However, in general, convex methods are 
still slower than non-convex optimization methods on the small data sets. 



5.3.2 Large-Scale Experiments 

In this section, we further evaluate the scalability of WellSVM on large data sets when 
the linear kernel is used. In this case, the WellSVM only involves solving a sequence of 
linear SVMs. As packages specially designed for the linear SVM (such as LIBLINEAR) are 



11. http : //www. cse .ust .hk/~twinsen 

12. ,http : / /binzhao02 . googlepages . com/ 1 
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Table 11: Clustering accuracies on various data sets. "-" indicates that the method does 
not converge in 2 hours or out-of-memory problem occurs. 



G Iter Iter CP Well 



Data 


KM 


KM-r 


KKM 


KKM-r 


NC 


MMC 


SVR 


SVR-r 


MMC 


SVM 


EchocardiogTCLTfi 


0.76 





76 


0.76 


0.77 


0.76 


0.7 


0.74 


0.78 


0.82 


0.83 


House 


0.89 


n 


89 


0.89 


0.88 


0.89 


0.78 


0.87 


0.87 


0.53 


0.93 


TJpdvf 


0.66 





59 


0.69 


0.59 


0.57 


0.7 


0.59 


0.59 


0.56 


0.74 


Heart-statlog 


0.68 





79 


0.78 


0.79 


0.79 


0.77 


0.76 


0.76 


0.56 


0.81 


Haberman 


0.6 





59 


0.69 


0.64 


0.7 


0.6 


0.62 


0.57 


0.74 


0.74 


LiverDisorders 


0.55 





54 


0.56 


0.56 


0.57 


0.55 


0.53 


0.51 


0.58 


0.58 


Specif 


0.58 





57 


0.77 


0.77 


0.63 


0.64 


0.53 


0.53 


0.73 


0.73 


Ionosphere 


0.7 





71 


0.73 


0.74 


0.7 


0.73 


0.71 


0.65 


0.64 


0.77 


House-votes 


0.87 





87 


0.87 


0.87 


0.86 


0.6 


0.83 


0.82 


0.61 


0.88 


Clean 1 


0.54 





54 


0.59 


0.62 


0.52 


0.66 


0.61 


0.53 


0.56 


0.56 


Isolet 


0.98 





96 


0.89 


0.95 


0.98 


0.56 


1.00 


1.00 


0.5 


1.00 


Australian 


0.54 





55 


0.57 


0.57 


0.56 


0.6 


0.56 


0.51 


0.56 


0.82 


Diabetes 


0.67 





67 


0.69 


0.69 


0.66 


0.69 


0.66 


0.66 


0.65 


0.68 


German 


0.57 





56 


0.68 


0.62 


0.66 


0.56 


0.56 


0.64 


0.7 


0.7 


Krvskp 


0.52 





51 


0.55 


0.55 


0.56 




0.51 


0.51 


0.52 


0.57 


Sick 


0.68 





63 


0.88 


0.77 


0.84 




0.63 


0.59 


0.94 


0.94 
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Figure 7: CPU time (in seconds) on the UCI data sets. 

much more efficient than those designed for general kernels (such as LIBSVM), it can be 
expected that the linear WellSVM is also scalable on large data sets. 

The real-sim data contains 72,309 instances and has 20,958 features. To study the effect 
of sample size on performance, different sampling rates (1%, 2%, 5% and 10%, 20%, . . . , 100%) 
are considered. For each sampling rate (except for 100%), we perform random sampling 5 
times, and report the average performance. Since /c-means depends on random initializa- 
tion, we run it 10 times for each sanipling rate, and report its average accuracy. Figure [8] 
shows the accuracy and running time^i. As can be seen, WellSVM outperforms fc-means 
and can be used on large data sets. 



13. fc-means is implemented in matlab, and so its running time is not compared with WellSVM, whose 
core procedure is implemented in C++. 
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Figure 8: Clustering results on the real-sim data with different numbers of examples. 



The RCVl data is very high-dimensional and contains more than 677,000 instances. 
Following the same setup as for the Real-sim data, WellSVM is compared with fc-means 
under different sampling rates. Figure [9] shows the results. Note that /c-means does not 
converge in 24 hours when more than 20% training instances are used. As can be seen, 
WellSVM obtains better performance than fe-means and WellSVM scales quite well on 
RCVl. It takes fewer than 1,000 seconds for RCVl with more than 677,000 instances and 
40,000 dimensions. 
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Figure 9: Clustering results on the RCVl data with different numbers of examples. 



6. Conclusion 

Learning from weakly labeled data, where the training labels are incomplete, is generally 
regarded as a crucial yet challenging machine learning task. However, because of the un- 
derlying mixed integer programming problem, this limits its scalability and accuracy. To 
alleviate these difficulties, we proposed a convex WellSVM based on a novel "label gen- 
eration" strategy. It can be shown that WellSVM is at least as tight as existing SDP 
relaxations, but is much more scalable. Moreover, since it can be reduced to a sequence of 
standard SVM training, it can directly benefit from advances in the development of efficient 
SVM softwares. 
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In contrast to traditional approaches that are tailored for a specific weak-label learning 
problem, our WellSVM formulation can be used on a general class of weak-label learning 
problems. Specifically, WellSVM on three common weak-label learning tasks, namely 
(i) semi-supervised learning where labels are partially known; (ii) multi-instance learning 
where labels are implicitly known; and (iii) clustering where labels are totally unknown, 
can all be put under the same formulation. Experimental results show that the WellSVM 
obtains good performance and is readily scalable on large data sets. We believe that similar 
conclusio ns can be reached on oth er weak-label learning tasks, such as the noisy-tolerant 
problem ( Angluin and Laird . 19881 ). 

The focus of this paper is on binary weakly labeled problems. For multi-class weakly 
labele d problems, they can be ea sily handled by decomposing into multiple binary prob- 
lems ([Crammer and Singed . |2002| ) . However, one exception is clustering problems, in which 
existing decomposition methods cannot be applied as there is no label. Extension to this 
more challenging multi-class clustering scenario will be considered as a future work. 
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Appendix A. Proof of Theorem [2] 

Proof Let {a^^^/i^*)} be the optimal solution of Eq. (jlip . which can be viewed as a 
saddl e-point problem. Let J(ot,u) = X]yeC(*) l^ydyi^)- Using the saddle-point prop- 
erty ()Bovd and Vandenberghd . |2004| ) , we have 

J{cx,fi^^) > J{a^^\fl^^^) > J{a^^\fi), Vq,/i. 

In other words, Q^*^ minimizes J(q;, /i*-*-*). Note that gy{ck) is A-strongly convex and 

XlygcC) /^y ^ ~ ^^'^^ •^(ctjA*'*^) is also A-strongly convex. Using the Taylor expansion, 
we have 

J(«,/i^*)) - J(aW,/iW) > - aWf, Va G A. 
Using the definition of J{a,fi), we then have 



Let y(*+i) be the violated label vector selected at iteration i + 1 in Algorithm [H i.e., 
C*+^ = C**) Uy*'*^^^- From the definition, we have 

= -G(aW,y(*+i)) > max y) + e = max 5y(aW) + e 

^ yec(*) yec(*) 

> E 4*W("^*^) + e = + (42) 
y6C(*) 



39 



Li, Tsang, Kwok and Zhou 



Consider the following optimization problem and let be its optimal objective value: 



P 



(*+^) - - min max 9 J] fifgy{<^) + (1 - ^)9y(t+i)(a). (43) 



When = 1, it reduces to Eq. ()lip at iteration t, and so < On the other 

hand, note that ^X^yeCf*) P'f'^ + {1 — 9) = 6 + {1 — 9) = 1, the optimal solution in Eq. ([l3]) 
is suboptimal to that of Eq. ijlip at iteration t + 1. Then we have < Let 

= _ TTj^ now we aims at showing r/ > ( "'^^^'^^"^ )^ which obviously induces our 
final inequality Eq. (|12p . 

Let {a.^^\9^ be the optimal solution of Eq. (l43p . we have following inequalities 

pit)_^ < _ ^ -^fg.{c,i% (44) 

pW-r/ < -(7^(,+i)(aW). (45) 
Using Eqs. (jH]), (glD, (01]) and (05]), we have 

V > E E 4'W(«^*^)>^ii-^*^-«^*^f> (46) 

e-r? < 5y(*+i)(a^*V5y(*+i)(a^*^) <M||a^*^-a^*^ll- (47) 
On combining Eqs. and (07]), we obtain 

e-rj< Ma/-^, 
V A 

and then finally we have rj > ( ~'^+^^^± ^\ ^ where c — ^ ■ 
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